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Abstract 

Background: Tumor biomarkers are potentially useful in several ways such as the identification of individuals at 
increased risk of developing cancer, in screening for early malignancies and in aiding cancer diagnoses; tumor 
biomarkers may also be used for determining prognosis, predicting therapeutic response, patient tracking following 
curative surgery for cancer and for monitoring therapy. Epigenetic alterations, especially aberrant DNA methylation, 
are recognized as common molecular alterations in a variety of tumors and also occur during the development of 
tumors. The Cancer Grade Predictor (CGPredictor) is an extendable package with functions designed to facilitate 
systematic integrated and rapid analysis of high-throughput methylation through the use of most self-similarity 
subgroups of patients supported by various validating examinations with regarded to survival outcome to obtain 
the identity of the target predictor. 

Results: We used high-grade serous ovarian cancer (HGSOC) and invasive breast carcinoma (BRCA) to demonstrate 
the usefulness of the CGPredictor package. The clustering results and the identity predictors worked well and 
efficiently in producing significant results after various tests were used to validate the usefulness of CGPredictor 
package. Also, some of the markers for either the HGSOC or BRCA marker panel have been previously reported to 
reveal significant results. Even when performed using a different platform with an independent large population 
BRCA dataset for validation, the identity predictor provided an accurate assessment of patient conditions and 
produced significant results. 

Conclusions: CGPredictor package is not a customized analysis tool designed specifically for the identification of 
only one or a few specific types of cancer but can be applied more broadly; moreover, the results indicate that the 
extracted predictors may worthy of consideration for further clinical testing to identify their potential usefulness for 
clinical molecular diagnosis and targeted treatments of patients with HGSOC and BRCA. So, the use of CGPredictor 
is feasible for examining the statistical significance of specific markers of interest and shows great potential for use 
with other types of cancers for cancer biomarker mining. 




Background 

DNA methylation has attracted a great amount of interest 
in the field of cancer research and is currently considered 
to be a common abnormality found during tumor initia- 
tion and subsequent cancer progression [1-3]. DNA 
methylation of CpG islands regulates gene expression 
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patterns in cancers [2,4] . Also, DNA hypermethylation of 
promoter-associated CpG islands of tumor suppressor, 
which leads to transcriptional silencing of these genes, has 
been the most studied epigenetic alteration in human neo- 
plasia [4]. Methylation patterns and gene expression pro- 
files can be measured on a genome-scale with microarrays 
which enable integration of these data for further identifi- 
cation of genes that are crucial to cancer progression. 
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An early diagnosis is critical for the successful treat- 
ment of many types of cancer. DNA methylation is clo- 
sely related to the development of cancer [5]. Since DNA 
methylation occurs early and can be detected in body 
fluids, it may be of potential use in the early detection of 
tumors and for determining the prognosis of some 
patients [1-3]. The potential to use DNA methylation to 
determine a patient's prognosis, to predict therapeutic 
response, for surveillance following curative surgery for 
cancer and to monitor affected critical genes presents 
researchers with an attractive option for exploring the 
clinical use of DNA methylation during the treatment of 
malignancies. A preventive strategy is needed for patients 
allowing the use of biomarkers designed to guide physi- 
cians in the placement of patients into appropriate 
screening or surveillance programs for the early detection 
of cancers. Hence, more reliable markers associated with 
a large population-base of tumors need to be developed 
for widespread use in the diagnosis and treatment of can- 
cer. The primary goal of CGPredictor package is to iden- 
tify and examine biomarkers from strong self- similarity 
pattern on patients' profiles and the package can be 
paired with various validation methods designed to facili- 
tate the identification of distinct phenotypes in a variety 
of cancers. 

To demonstrate the utility of CGPredictor, we analyzed 
alterations in DNA methylation in different cancers of 282 
patients with HGSOC [6] as well as 241 patients with 
BRCA [7] using the Cancer Genome Atlas portal. Tables 1 
and 2 show the clinical characteristics of the patients con- 
sidered in this study. We believe CGPredictor allows 
researchers to use the first systematic approach which can 
be used to support the mining and examining cancer bio- 
marker candidates followed with various validation ana- 
lyses and we found it to be highly efficient (see Table 4). 
Whether performed using HGSOC or BRCA patients, the 
statistical significance of the predictor and the clustering 
genes can be examined; also known cancer markers could 



be identified in the predictors based on previous reports in 
the literature. 

Methods 

The use of CGPredictor requires several major steps. In 
the clustering step, the function in the CGPredictor pack- 
age called "kmeans" is used to cluster samples. In the bio- 
marker selection step, the user can set parameters to 
choose hypermethylation/hypomethylation corresponding 
to the downregulated/upregulated intensity between the 
clustered phenotypes. During the predictor performance 
examination step, the Cox test is calculated with the clus- 
tered clinical outcome of distinct phenotypes and the ran- 
dom selection test can be performed for further validation 
to increase confidence that gene sets have not been 
selected randomly. Once validated, a bootstrap test was 
used to examine the significance between the clustering 
genes and the phenotypes. 

First, the beta value matrix is used for the most self- 
similarity pattern on patients' profiles clustered together 
by kmeans function in CGPredictor. To extract the bio- 
marker candidates, gene name is used to link the methy- 
lation and gene expression matrices. Also, the mean of 
gene intensity in each cluster group was determined both 
for gene expression and DNA methylation for subsequent 
molecular intensity comparison between clustered phe- 
notypes. Then, the filter function in the CGPredictor 
package can be used to obtain the biomarker candidates 
which are predictors for corresponding hypermethyla- 
tion/hypomethylation to downregulated/upregulated 
genes between phenotypes. Then, the function in CGPre- 
dictor for Kaplan-Meier (KM) curves and Cox test with 
any observed significant differences in survival for differ- 
ent patient groups can be used to estimate the perfor- 
mance of the predictors. To increase the level of 
statistical confidence and for further validation of the 
relationship found between clustering genes and the phe- 
notypes and the significance of the predictor, bootstrap 



Table 1 Characteristics of the HGSOC participants used in the analysis 





O-CIMP-negative 


O-CIMP-positive 


Total 


No. of Patients 


81 


32 


113 


Patient Phenotype Age, years 


Median (LQ, UQ) 


56(50, 63) 


60(56, 70) 


57(51, 66) 


No. < 40 years old 


6 


0 


6 


Survival (in months) 


Median 3 (LQ, UQ) 


20.9(11.7, 34.6) 


1 7.7(6.8, 27.8) 


20(10.5, 30.9) 


Sex 


Female 


81 


32 


113 


Male 


0 


0 


0 



LQ, lower quartile; UQ, upper quartile; and CI, confidence interval. 

a Median survival and corresponding confidence intervals were estimated from the Kaplan-Meier curve 
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Table 2 Characteristics of the BRCA participants used in the analysis 





B-CIMP-negative 


B-CIMP-positive 


Total 


No. of Patients 


77 


12 


89 


Patient Phenotype Age, years 


Median (LQ, UQ) 


57(46, 66) 


66.5(61, 70) 


59(47, 68) 


No. < 40 years old 


8 


1 


9 


Survival (in months) 


Median 3 (LQ, UQ) 


14.3(7.6, 21.1) 


14.5(7.9, 18.8) 


19.2(10.7, 32.9) 


Sex 


Female 


77 


12 


89 


Male 


0 


0 


0 



LQ, lower quartile; UQ, upper quartile; and CI, confidence interval. 

a Median survival and corresponding confidence intervals were estimated from the Kaplan-Meier curve 



and random selection tests can be performed, respec- 
tively. The relationship between clustering genes and the 
distinct subtype of patients could be measured using the 
bootstrap test. The bootstrap sample datasets are from 
the original cancer dataset; we used sampling with repla- 
cement with a default iteration of 1,000 times. Also, the 
original clustering genes were used for kmeans clustering 
in each rebuilt sample set. Then, the sensitivity would be 
performed for measuring the statistical significance 
among the 1,000 iteration sampling dataset. Moreover, 
the random selection test function is designed to ran- 
domly select the same number of genes as were originally 
extracted as biomarker candidates for a specific cancer. 
The function in CGPredictor can also be used to effi- 
ciently test the extracted predictor's significance with the 
same default of 1,000 iterations (see Table 4). The pro- 
graming structure in CGPredictor functions is user 
friendly. It will allow for future procedure extension as 
long as the development of the new packages follow the 
recommended input and output methods for data struc- 
ture of every function of CGPredictor. Also, CGPredictor 
is highly extendible for user modification with any of the 
functions which can be implemented by R. CGPredictor is 
not limited to DNA methylation microarrays and is scal- 
able to various kinds of microarray analysis problems. 
However, our integrated system is limited to use on MAC 
and Windows operating systems and cannot be used on 
Linux systems, for example. 

Measuring how confident one can be of the usefulness 
of the extracted biomarker candidates is very important 
in cancer biomarker mining. Aside from some basic pro- 
cessing functions in our integrated system, the statistical 
validation functions play a critical role for examining the 
extracted biomarker candidates. Users can measure how 
their confidence in the relationship found between fea- 
ture and the clustered phenotypes as well as the ability of 
the predictor to examine the quality and significance of 
the biomarker candidates they extracted using our pack- 
age, CGPredictor. 



Results 

Study population 

We used the CGPredictor package to analyze 282 HGSOC 
and 241 BRCA patients using Infinium HumanMethyla- 
tion27K (Illumina Inc., San Diego, CA, USA) including 
27,578 CpG dinucleotides spanning about 14,000 genes 
accessed from the Cancer Genome Atlas (TCGA) data 
portal. Furthermore, an analysis of another large indepen- 
dent dataset including 596 BRCA patients was analyzed 
on a different platform, HumanMethylation450k; this was 
performed for validation in the proposed R package. In 
earlier work, the hESC specific gene panel has been found 
to be enriched in poorly differentiated tumors [8]. Based 
on the previous reports [8,9], we then compiled related 
hESC gene sets. ESC over-expressed genes [10], Nanog, 
Oct4 and Sox2 targets [11], Polycomb targets in hESCs 
[12], and Myc targets [13,14]. Then, the primary analysis 
was limited to the common gene set including a total of 
3,800 genes for subsequent analysis. 

High-grade serous ovarian cancer data analysis and 
various validations 

After kmeans clustering, the two extreme phenotypes 
which included the most normal tissues and the most 
abnormal tissues were labeled as O-CIMP-negative (high 
grade serous ovarian cancer CpG island methylator phe- 
notype) and O-CIMP-positive, respectively. Toyota, et al. 
first characterized a CpG island methylator phenotype 
(CIMP) in human colorectal cancer [15]. When hyper- 
methylated and downregulated genes in HGSOC were 
retrieved, the 43 extracted genes (as predictor in 
HGSOC) included SOX1, CALCA, DCC, GATA4, and 
NID2, which are the five genes known to be connected to 
HGSOC. Aside from the five of 43 biomarker candidates 
which have been reported to have significant usefulness, 
the KM curve and Cox test for the specific phenotype 
distinction had a p-value of 0.01647 (Figure 1). This indi- 
cates the distinct phenotypes clustered by the extracted 
predictor are significantly different from each other. 
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Figure 1 The relationship between O-CIMP status and patient 
outcome clustered by the predictor of HGSOC. O-CIMP-positive 
(blue lines) and O-CIMP-negative (red) is shown for each KM survival 
curve. The distinct DNA methylation phenotype within HGSOC 
patients was identified; a significantly better survival was observed 
for O-CIMP-negative patients when compared to O-CIMP-positive 
patients. 



Furthermore, the predictor for HGSOC were also signifi- 
cant (p < 0.0001 after 1,000 iterations) when genes were 
randomly selected for examining the significance of the 
extracted predictor. After the bootstrapping with 1,000 
iterations, the data was found to be statistically significant 
(p < 0.0001) verified the significance of the clustering 
results. These results showed that using an extracted pre- 
dictor from CGPredictor package defined by DNA 
methylation status is adequate for finding an independent 
predictor for determining cancer phenotype. Also, the 
usefulness of the predictor is worth further examination 
during future clinical testing. 

Breast cancer data analysis and various validations 

We also considered the 241 BRCA patients which were fol- 
lowed for DNA methylation, mRNA expression and data- 
sets of clinical records as another way of validating the 
usefulness of CGPredictor. The two distinct phenotypes, 
B-CIMP-negative (BRCA CpG island methylator pheno- 
type) and B-CIMP-positive were obtained after clustering. 
After using the same processes as used for HGSOC, ten 
genes were filtered out as predictors. Among these ten 
genes, BMP6 and GSTP1 have previously been well docu- 
mented as exhibiting tumor-specific methylation altera- 
tions. The two distinct phenotypes were assessed as 
significant {p = 0.0075, Figure 2), after using the function 
for conducting a Cox test in CGPredictor. The result indi- 
cates the gene panel remained a significant predictor of the 
two distinct phenotypes in patients with BRCA. Further- 
more, both the bootstrap test function and the random 
selection test produced significant results (p < 0.0001); the 
former was implemented in BRCA for examining the 




B-CIMP-positjve 
— B-CIMP-negative 



p = 0.0075 
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Figure 2 KM survival curve for the distinct BRCA phenotype 

The significantly better survival for B-CIMP-negative (red) patients 
compared to B-CIMP-positive (blue) patients was also observed from 
the plot data; the significant difference between phenotypes was 
assessed by the predictor evaluated from CGPredictor. 



relationship between genes for clustering and the distinct 
phenotypes and the latter test was used for examining the 
significance of the predicted predictor using randomly 
selected genes for 1000 repetitions. The result shows the 
clustering result performed by those clustering genes and 
the extracted predictor for BRCA were significant. 

Furthermore, in addition to the support from various 
validation analysis results and when considering some 
biomarker candidates which have been significantly 
reported previously, we used another large independent 
dataset which was analyzed on a different platform. Spe- 
cifically, HumanMethylation450k, was performed on 596 
BRCA patients in the CGPredictor R package. Table 3 
shows the clinical characteristics of those patients. The 
Cox test supported the use of the identity predictor as a 
feasible and significant {p = 0.01798) predictor which 
could distinguish the two phenotypes very well for 
BRCA (Figure 3). The results indicate the devised 
CGPredictor package, when supported with the various 
validation methods, could accurately identify a reliable 
and genome scale cancer independent prognostic epige- 
netic marker panel. Also, CGPredictor is not simply a 
tool that custom designed for identifying a specific can- 
cer. CGPredictor can be broadly applied in biomarker 
mining for various types of cancer. 

Discussion 

For analysis of the HGSOC and BRCA patient data, 
CGPredictor package was used to group the most self- 
similarity pattern on patients' profiles with cancer as 
subgroups and allowed the identification of 43 and 10 
genes as predictors for HGSOC and BRCA, respectively. 
Significant survival differences were seen in the two 
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Table 3 Characteristics of the BRCA participants used in the independent validation analysis 





B-CIMP-negative 


B-CIMP-positive 


Total 


No. of Patients 


96 


108 


204 


TCGA Patient Phenotype Age, years 


Median (LQ, UQ) 


55(45, 66.25) 


62(54.7,71) 


60(49,68.25) 


No. < 40 years old 


14 


6 


20 


Survival (in months) 


Median" (LQ, UQ) 


35.6(17.5,55.5) 


20.3(6.9,46.3) 


28.2(12.9,48.3) 


Sex 


Female 


96 


105 


201 


Male 


0 


3 


3 



LQ, lower quartile; UQ, upper quartile; and CI, confidence interval. 

a Median survival and corresponding confidence intervals were estimated from the Kaplan-Meier curve 



Table 4 The performance evaluation of the package CGPredictor 

Sample size Read raw data Process Bootstrap (1,000 iterations) Random selection (1,000 iterations) 

HGSOC 282 samples 25 sec 8 sec 21 sec 463 sec 

5640 probes 

BRCA 241 samples 16 sec 6 sec 16 sec 393 sec 

3038 probes 

Intel Core2 2.33 GHz, 2 GB memory, Windows XP 



distinct phenotypes defined by DNA methylation status 
(Figure 1 and 2). Previous reports have identified filtered 
hypermethylation and downregulated genes including 
SOX1, CALCA, DCC, NID2, and GATA4 as significant 
HGSOC markers. As for the predictor for BRCA, GSTP1 
and BMP6 both of these have previously been reported 
to be significantly related to the presence of BRCA. 

Based on these results, to test to see if the relationship 
between the established clustering gene and the pheno- 
types was significant, we used bootstrapping with 1,000 




0 100 200 300 400 500 600 



Time (Weeks) 

Figure 3 Kaplan-Meier survival curves comparing B-CIMP- 
positive (red) and B-CIMP-negative (blue) patients performed 
with a different independent platform dataset. Obviously, the 
significant survival differences were demonstrated for phenotypes 
by the extracted predictor through the CGPredictor package. 



iterations; for both HGSOC and BRCA, the clustering 
results were statistical significance of the clustering result. 
The identity predictors for each specific type of cancer 
were examined with the randomly selected genes for the 
same number of extracted markers in specific cancers for 
1,000 iterations. For both the bootstrap test and the ran- 
dom selection test use here, the results were significant 
(p < 0.0001). Moreover, the predictor for BRCA was 
shown to be capable of indicating significant variations in 
survival rates using a different independent large popula- 
tion dataset performed using Infinium HumanMethyla- 
tion450 (Figure 3). These results indicate that the 
extracted predictor and the clustering results examined 
from various validations all produce reliable results using 
CGPredictor; also the CGPredictor package has very good 
potential for use in mining and examining independent 
prognostic epigenetic marker panels for other cancers. 

When retrieving hypermethylated and downregulated 
genes indicative of HGSOC, the 43 selected genes 
includes five which have been previously reported to be 
connected to HGSOC: SOX1, CALCA, DCC, GATA4, 
and NID2. Sox domain proteins are a class of develop- 
mentally important transcriptional regulators related to 
the mammalian testis determining factor SRY [16]. Sox 
Bl group genes, Soxl, Sox2, and Sox3, are involved in 
neurogenesis in various species and only the overexpres- 
sion of Soxl in cultured neural progenitor cells is suffi- 
cient to induce neuronal lineage commitment [17]. The 
methylation of SOX1 has been reported as being corre- 
lated with the recurrence of ovarian cancer and with 
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overall survival rates for patients with ovarian cancer 
[18]. As for the gene GATA4, it is expressed in most 
organs and plays a critical role in the development of 
these organs [19]. GATA4 is initially expressed during 
the formation of extraembryonic endoderm differen- 
tiated from the pluripotent embryonic stem cells of the 
inner cell mass during early embryonic development 
[20] and is also expressed in human ovarian epithelial 
cells [21,22]. However, GATA4 is often lost in ovarian 
cancer cells [21,23]. The GATA4 gene is believed to dic- 
tate distinct pathological pathways leading to serous ovar- 
ian carcinomas [24]. Nidogen-2 (NID2) is a basement 
membrane protein. The basement membrane plays an 
important role in maintaining tissue organization and 
compartmentalization [25]. Thus, either removal or dis- 
ruption of the integrity of the basement membrane creates 
an invasion-permissive environment, often promoting can- 
cer cell proliferation and invasion [26,27]. The loss of 
nidogen expression has been shown to have a potential 
pathogenic role in colon and stomach tumorigenesis [28]. 
Also, the NID2 is reported to be a biomarker for ovarian 
cancer and has been reported to be closely correlated with 
CA125 [29]. DCC (Deleted in Colorectal Carcinoma) is an 
important tumor suppressing gene. DCC is a metastasis 
suppressor gene which targets both proinvasive and survi- 
val pathways in a cumulative manner in combination with 
other genes [30]. Previous report indicated 52% of malig- 
nant ovarian cancers did not express the DCC gene, and 
also suggested a significant correlation exists between 
DCC expression and ovarian cancer [31]. As for the pro- 
moter of CALCA, it was also informative for differentiating 
methylation between the early stages of ovarian disease 
and the healthy maintenance of control [32]. 

In related analysis, two well-known genes are among the 
ten extracted biomarker candidates which is predictor of 
BRCA. For instance, BMP6 and GSTP1 are involved in sig- 
nal transduction and cell detoxification, respectively. 
These two genes are two of the top ten hypermethylated 
genes which have been identified and are used to distin- 
guish between cancerous and normal tissues [33] and dif- 
ferent kinds of cohorts have been used for these purposes 
[34]. Both papers [33,34] suggested the genes might be 
useful predictors for developing epigenetic-based predic- 
tive and prognostic biomarkers for breast cancer. A pre- 
vious study has also tested from women with palpable 
lesions suspicious of breast cancer for aberrant promoter 
hypermethylation, and the GSPT1 candidate gene can be 
easily detected in fine needle aspirated washings. Promoter 
hypermethylation in benign and malignant lesions was 
more commonly found in GSPT1 than the reported candi- 
date genes [35]. Another previous study determined the 
frequency of aberrant methylation of GSTP1 candidate 
gene in primary breast cancer tissue for patients with pre- 
dominantly advanced cancers and suggested that GSTP1 is 



potentially important in the early diagnosis of breast can- 
cer [36]. 

Conclusions 

The detection of cancer-specific alterations in DNA 
methylation warrants further investigation because it pro- 
vides a potential benefit in the early diagnosis of cancer as 
well as in the evaluation of the prognosis and therapeutic 
responsiveness of patients. We developed an effective and 
flexible tool for mining and examining predictors sup- 
ported by systematic analysis. In addition to efficiently per- 
forming the analysis, the CGPredictor package has a 
variety useful functions which can assist researchers in 
examining the statistical significance of predictors/specific 
genes of interest as well as clustering results. With these 
significant results and based on the fact that some signifi- 
cant genetic markers have been reported previously in the 
literature for both HGSOC and BRCA, our findings pro- 
vide further support for idea that CGPredictor package 
has great potential for mining and examining genome 
scale independent prognostic epigenetic marker panels for 
various cancers and also support the potential of the 
retrieved predictors future clinical testing. 

Availability 

CGPredictor R package is implemented in R and is 
freely available at http://goo.gl/DVqni. A vignette with 
detailed descriptions of the functions and examples is 
included. 
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